Goto

Collaborating Authors

 sample-efficient deep reinforcement learning



Reviews: Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update

Neural Information Processing Systems

The paper proposes to use episodic backwards updates to improve data efficiency in RL tasks, furthermore they introduce a soft relaxation of this in order to combat the overestimation that typically comes from using backwards updates when using Neural Network models. Overall the paper is very clearly written. My main concerns with the paper are in the experimental details as well as in the literature review, also when taking into account the existing literature the novelty of the work is quite limited. The idea of using backwards updates is quite old and goes back to at least the 1993 paper "Prioritized Sweeping" by Moore and Atkeson, which in fact demonstrates a method that is very similar to what the authors propose and which the authors fail to cite. Furthermore recently there were quite a few papers operating in a similar space of ideas using a backward view in ways similar to the authors, e.g.: Fast deep reinforcement learning using online adjustments from the past, https://arxiv.org/abs/1810.08163


Reviews: Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update

Neural Information Processing Systems

All reviewers recommend accepting the paper. The authors response did address most of the reviewers' concerns. While the AC recommends accepting the paper, the AC encourages the authors to consider the comments of reviewer 1. Only changing the backup mechanism keeping all other hyper parameters fixed as in the Nature DQN model is indeed a good experimental setup. However, the optimal operation mode for different models might be different (even when sharing architectures and training protocols): for instance we could'afford' a larger learning rate if we have a better back-up mechanism.


Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update

Neural Information Processing Systems

We propose Episodic Backward Update (EBU) – a novel deep reinforcement learning algorithm with a direct value propagation. Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate directly through all transitions of the sampled episode. We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments. Especially in 49 games of Atari 2600 domain, EBU achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples, respectively.


Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update

Lee, Su Young, Sungik, Choi, Chung, Sae-Young

Neural Information Processing Systems

We propose Episodic Backward Update (EBU) – a novel deep reinforcement learning algorithm with a direct value propagation. Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate directly through all transitions of the sampled episode. We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments. Especially in 49 games of Atari 2600 domain, EBU achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples, respectively. Papers published at the Neural Information Processing Systems Conference.


Sample-efficient Deep Reinforcement Learning for Dialog Control

Asadi, Kavosh, Williams, Jason D.

arXiv.org Machine Learning

Representing a dialog policy as a recurrent neural network (RNN) is attractive because it handles partial observability, infers a latent representation of state, and can be optimized with supervised learning (SL) or reinforcement learning (RL). For RL, a policy gradient approach is natural, but is sample inefficient. In this paper, we present 3 methods for reducing the number of dialogs required to optimize an RNN-based dialog policy with RL. The key idea is to maintain a second RNN which predicts the value of the current policy, and to apply experience replay to both networks. On two tasks, these methods reduce the number of dialogs/episodes required by about a third, vs. standard policy gradient methods.